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Abstract 


A  compact  scheme  is  a  discretization  scheme  that  is  advantageous  in  obtaining  highly 
accurate  solutions.  However,  the  resulting  systems  from  compact  schemes  are  tridiag¬ 
onal  systems  that  are  difficult  to  solve  efficiently  on  paraUel  computers.  Considering 
the  almost  symmetric  Toeplitz  structure,  a  parallel  algorithm,  simple  parallel  prefix 
(SPP),  is  proposed.  The  SPP  algorithm  requires  less  memory  than  the  conventional 
LU  decomposition  and  is  highly  efficient  on  parallel  machines.  It  consists  of  a  prefix 
communication  pattern  and  AXPY  operations.  Both  the  computation  and  the  commu¬ 
nication  can  be  truncated  without  degrading  the  accuracy  when  the  system  is  diagonally 
dominant.  A  formal  accuracy  study  has  been  conducted  to  provide  a  simple  truncation 
formula.  Experimental  results  have  been  measured  on  a  MasPar  MP-1  SIMD  machine 
and  on  a  Cray  2  vector  machine.  Experimental  results  show  that  the  simple  paral¬ 
lel  prefix  algorithm  is  a  good  algorithm  for  the  compact  scheme  on  high-performance 
computers. 


‘This  research  was  supported  by  the  National  Aeronautics  and  Sp2u:e  Administratio’’  under  NASA  contract  NAS1- 
19t80  while  the  first  author  was  in  residence  at  the  Institute  for  Computer  Applications  in  Science  and  Engineering 
(ICASE),  NASA  Langley  Research  Center,  Hampton,  VA  ‘23681-0001. 
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1  Introduction 


Recent  technological  advances  have  made  it  possible  to  build  computers  that  contain  thousands  of 
processors  and  can  obtain  gigaflops  (10®  floating-point  operations  per  second)  on  real  applications. 
Emerging  parallel  computers  are  designed  to  solve  large  problems  and  achieve  better  accuracy 
than  could  previously  be  obtained  [19].  Parallel  computers  demand  new  models,  new  discretization 
methods,  and  new  algorithms  to  explore  the  potential  of  high-performance  computing. 

Conventionally,  partial  differential  equations  (PDEs)  are  discretized  by  finite-difference  or  finite- 
element  methods,  and  are  solved  by  Gauss-Seidel,  conjugate-gradient,  or  successive  overrelajcation 
(SOR)  methods.  A  new  discretization  method,  the  compact  finite-difference  scheme  {or  compact- 
difference  scheme,  compact  scheme)  was  proposed  by  Kreiss  and  Oliger  [10]  and  was  later  im¬ 
proved  upon  by  Lele  [14].  Compared  with  the  traditional  finite-difference  scheme,  the  compact 
finite-difference  scheme  achieves  higher  accuracy  with  smaller  difference  stencils  and  leads  to  more 
accurate  approximations  because  of  the  smaller  coefficients  on  the  truncation  error.  With  these  ad¬ 
vantages,  the  compact  scheme  has  quickly  gained  popularity.  In  practice,  the  resulting  discretized 
system  of  the  compact  schemes  are  tridiagonal  systems  that  can  be  solved  efficiently  on  sequential 
machines.  However,  tridiagonal  systems  are  difficult  to  solve  efficiently  on  parallel  computers.  For 
example,  to  study  the  physics  of  compressible  homogeneous  turbulence,  the  CDNS  (compressible 
direct  simulation  of  Navier-Stokes)  code,  based  on  a  sixth-order  compact  scheme  and  a  third-order 
time  discretization,  was  implemented  on  the  Intel  Delta  [5].  After  carefully  choosing  an  existing 
tridiagonal  solver,  mapping  the  algorithm  to  the  architecture,  and  overlapping  communication  with 
computation,  the  communication  overhead  consumed  about  30  percent  of  the  total  execution  time. 
Clearly,  more  efficient  algorithms  are  needed  to  explore  the  potential  of  compact  schemes  on  parallel 
computers. 

In  recent  years,  intensive  research  has  been  done  to  develop  efficient  tridiagonal  solvers  on 
parallel  computers.  A  good  survey  can  be  found  in  references  [15],  [7],  and  [12].  Of  the  known 
tridiagonal  solvers,  both  the  recursive-doubling  reduction  method  (RCD)  developed  by  Stone  [17], 
and  the  odd-even,  or  cyclic,  reduction  method  (OER)  developed  by  Hockney  [8]  are  able  to  solve  an 
7i-dimensional  tridiagonal  system  in  0(log(ra))  time  using  n  processors.  These  methods  are  designed 
for  fine-grain  computing.  Substructured  methods  developed  by  Lawrie  and  Sameh  [13],  Wang  [24], 
and  Sun,  Zhang,  and  Ni  [23]  were  designed  for  median-grain  and  coarse-grain  computing  (i.e.,  the 
case  of  p  <  n  or  p  <<  n,  where  p  is  the  number  of  processors  available).  Lawrie  and  Sameh's 
algorithm  is  designed  for  shared-memory  machines;  Wang’s  algorithm  is  designed  for  distributed- 
memory  machines;  and  Sun  et  al.  proposed  three  different  algorithms,  each  of  which  may  be  a  better 
choice  depending  on  the  problem  and  the  machine.  For  compact  schemes,  the  tridiagonal  systems 
have  a  special  rtructurp  that  consists  of  diagonal  dominance  and  are  almost  symmetric  Toeplitz. 
For  this  special  structure,  a  parallel  tridiagonal  solver  for  fine-grain  computing,  the  simple  pamllel 
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prefix  (SPP)  algorithm,  is  proposed  in  this  paper. 

Compared  with  the  popular  tridiagonal  solver  for  fine-grain  computing  known  as  the  cyclic 
reduction  method  [8],  the  SPP  method  is  simple  to  implement  and  computationally  efficient.  It 
requires  only  2-log(n)  AXPY  (vector  plus  scalar  times  vector)  operations  and  prefix  communication 
patterns.  If  the  tridiagonal  system  is  diagonally  dominant,  then  the  AXPY  operations  can  be 
truncated  after  a  certain  number  of  steps  without  degrading  the  accuracy.  A  formal  accuracy 
analysis  is  conducted  and  simple  formulas  are  provided  to  compute  the  number  of  AXPY  operations 
necessary. 

This  paper  is  organized  as  follows.  Section  2  will  present  the  compact  scheme  and  discretized 
tridiagonal  systems.  Section  3  will  introduce  three  versions  of  the  simple  parallel  prefix  algorithm; 
the  SPP  for  tridiagonal  systems  with  the  given  special  structure,  the  SPP  for  solving  symmetric 
Toeplitz  tridiagonal  systems,  and  the  SPP  for  solving  almost  symmetric  Toeplitz  systems.  Accuracy 
analysis  will  be  conducted  in  Section  4.  Section  5  will  give  experimental  results  on  a  16K  processing 
elements  (PEs)  MasPar  SIMD  computer  and  on  a  Cray  2  supercomputer.  Finally,  Section  6  will 
give  the  conclusions. 


Compact  Finite-Difference  Schemes 


With  the  conventional  finite-difference  and  finite-element  discretization  methods,  as  the  order  of 
the  approximation  increases,  the  required  number  of  boundary  and  near-boundary  relations  and 
the  required  number  of  mesh  points  per  derivative  stencil  increases  accordingly.  To  achieve  higher 
accuracy  without  the  use  of  additional  mesh  points,  the  compact  scheme  [10]  was  introduced.  As 
originally  suggested  by  Kreiss  and  Oliger  [10],  and  later  discussed  for  fluid  dynamics  problems  by 
Hirsh  [6],  the  first  and  second  derivatives  for  compact  differences  may  be  approximated  by 

where 

^o/n  =  ~  ~  fn)i 

tlx 

and  hx  is  the  mesh  spacing,  which  is  constant  for  simplicity.  By  multiplying  (1)  by  the  respective 
denominators,  relations  for  the  derivatives  may  be  found,  which  yield 

g/n-l  +  g/n  +  g/n+l  =  ~  /"“l)'  (2) 


2 


and 


(3) 


_L  f "  4.  ^  f "  4. 

12’'"“'  ^  6’'"  ^ 

These  equations  yield  tridiagonaJ  systems  when  the  appropriate  boundary  conditions  are  applied. 

To  make  an  accurate  comparison  between  the  compact-difference  (eqs.  (2)  and  (3))  and  the 
standard  central-difference  schemes,  Taylor-series  expansions  are  employed.  As  Hirsh  has  shown, 
the  truncation  error  for  the  compact  differences  are 

£«)  =  -  and  £(/:)= 

Similar  error  analyses  for  the  central  differences  yield 

=  and 

Although  both  schemes  are  fourth-order  accurate,  the  compact-difference  scheme  should  lead  to 
more  accurate  approximations  as  a  result  of  the  smaller  coefficients  on  the  truncation  error.  Similar 
results  hold  for  other  higher  order  approximations. 

As  yet,  no  mention  has  been  made  about  the  boundary  treatment  for  the  compact  scheme. 
At  the  boundaries,  Hirsh  [6]  experimented  with  a  variety  of  boundary  conditions,  and  Adams  [1] 
suggested  a  boundary  relation  that  includes  near-boundary  derivatives  in  the  formulation.  The 
boundary  conditions  used  by  both  Hirsh  and  Adams  retained  the  tridiagonal  nature  of  the  sys¬ 
tem.  To  demonstrate  the  SPP  algorithm,  fourth-order  one-sided  finite  differences  will  be  used  for 
boundary  conditions. 

Many  relevant  fluid  dynamics  applications  can  make  use  of  high-order  compact-difference  oper¬ 
ators  to  numerically  solve  the  governing  systems  of  equations.  For  example.  Burger’s  equation,  the 
boundary-layer  equations,  and  the  driven  cavity  problem  were  solved  by  Hirsh  [6]  with  compact- 
difference  operators.  Further,  Joslin  et  al.  [9]  used  the  compact-difference  equations  (2)  and  (3)  to 
numerically  solve  the  fuUy  nonlinear  Navier-Stokes  equations  of  fluid  dynamics. 


(4) 


dui  dui  _  \  dp  d^Ui 

dt  ^^^dxj~  pdxi^  ^ dxjdxj' 

where  =  (ui, 02,03)  are  the  velocity  components  that  correspond  to  the  steamv/ise,  normal,  and 
spanwise  directions,  x,  —  (xi,X2,X3);  v  is  the  kinemallc  viscosity,  p  is  the  density,  and  repealed 
indices  infer  a  summation  over  the  index.  To  demonstrate  how  compact- difference  operators  could 
be  employed  to  solve  the  nonlinear  PDF’s  (4)  and  (5),  consider  the  model  problem  of  the  one- 
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dimensional  heat-conduction  equation: 


du  _  d^u 
~di  ~  '^dx^ 


(6) 


To  solve  this  equation  computationally,  discretizations  in  time  and  space  must  be  chosen.  From 
the  Taylor-series  expansions  in  time,  one  derives  the  discrete  equation 


=  u"  -F 

Ox^ 


(7) 


where  n  is  the  number  of  time  levels.  If  the  result  at  level  u”  is  known,  then  the  solution  u""*"’ 
can  be  obtained  if  d^ujdx^  can  be  determined.  The  spatial  derivative  can  be  computed  with  the 
second-derivative  operator  (3).  Each  time-step  advancement  requires  a  single  compact-difference 
solve.  In  the  original  nonlinear  PDE  systems  (4)  and  (5),  both  the  first  and  second  derivative 
operators  are  required;  the  problem  is  multidimensional  and  requires  a  compact-difference  solver 
with  many  right  sides.  Time-iparching  is  necessary  and  requires  compact-difference  solves  at  each 
cime  level.  This  necessitates  a  fast  compact-difference  solver.  By  observation,  equations  (2)  and 
(3)  take  the  matrix  form 

1  0 
1  c  1 


c 

0 


a:  =  d 


(8) 


where  x  =  {/',  /"}  and  c  =  {4, 10}  correspond  to  the  compact- difference  parameters.  The  first  and 
last  rows  of  equation  (8)  arise  from  boundary  conditions.  The  boundary  condition  that  corresponds 
to  can  be  rolled  into  the  system  by  xyy+i  =  djv-n  ^'^d  dj^  =  dj^  —  Thus,  the  system  is 

reduced  to  the  N  x  N  system: 


1  0 
1  c 


c  1 
1  c 


X  =  d 


(9) 


With  higher  orders  of  approximation,  the  resulting  matrix  will  differ  only  in  the  boundary 
conditions.  However,  with  the  appropriate  restructuring  given  above,  the  resulting  tridiagonal 
systems  can  be  written  in  the  almost  symmetric  ToepUtz  form  described  in  the  next  section,  eqn. 
(10),  where  A  is  given  by  (12). 
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3  The  Simple  Parallel  Prefix  Algorithm 

We  are  interested  in  solving  a  tridiagonal  linear  system  of  equations 


Ax  =  d. 


In  this  system  of  equations,  A  is  either  a  symmetric  Toeplitz  tridiagonal  system  of  order  n 


c  1 
1  c  1 


=  [l,c,l], 


or  an  almost  symmetric  Toeplitz  tridiagonal  system  of  order  n 

1 


where  x  =  ,  •  •  • ,  x„)^  and  d  =  (dj  •  •  • ,  dn)^  are  n-dimensional  vectors.  We  assume  that  matrix 

A  is  diagonally  dominant  (i.e.,  |c|  >  2).  Although  we  assume  that  A,  x,  and  d  have  real  coefficients, 
the  extension  to  the  complex  case  is  straightforward. 

3.1  The  Simple  Prefix  Method 

In  this  section,  we  study  how  to  efficiently  solve  the  system 


Ax  =  d 


on  parallel  computers,  where 


1  c 


b 

1 


=  0- [6, 1,0] -[0,1, 6], 


and  a  and  b  are  the  real  solutions  of: 


{o  +  6  =  < 
0-6=1 


Because  a  •  6  =  1  and  |c|  >  2,  we  may  further  assume  that  |a|  >  1  and  |6|  <  1.  By  equation  (13), 


X  =  o-'  •[0,l,6]-*[6,l,0]-*d  =  6-[0,l,6]-^[6,l,0]-V. 
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Let  L  =  [—6,0,0].  Then 


[6, 1,0]  =  [0,1,0] -[-6, 0,0]  =  /-/, 

and 

[6,1,0]-^  =  (/  +  L  +  l2  +  ...+  Z,n-l) 

=  (/  +  Z,2f‘’*”‘'-‘)(/  +  £2f"*’*l-^) . {I  ^  L^){I  +  L^){I  +  L).  (16) 

Note  that  n  is  the  dimension  of  matrix  A  and  that  equations  (15)  and  (16)  can  be  verified  directly. 
The  superscript  of  matrix  L  represents  matrix  multiplication 

/  0 
0  0 

L*  =  (-6)*  0  0 

0 

^  0  0  {-by  0  0 

where  the  first  nonzero  element  (-6)’  is  at  position  (i  +  1, 1).  Similarly,  let  U  =  [0,0,  -6].  Then, 

[0,1,6]=  [0,1,0] -[0, 0,-6]  =  /  -{/, 

and 

[0,1,6]-'  =  (/  +  {/  +  t/2 +  •••+{/"-') 

=  (/+{/"'’“*  "^■’) . {I  +  U^){i+U), 

where 

0  0  (-6)'  0  0  \ 

0 

0  0  (-6)‘  • 

0  0 

0 

The  first  nonzero  element  of  U'  is  at  position  (l,i+  1).  Thus,  the  solution  of  equation  (13)  is 

i  =  6  ■  (/  +  )...(/  +  (/).(/+  ^ . ^ 

Equation  (17)  was  first  used  by  Chung  and  Shen  [2]  in  solving  6-spline  curve  fitting  on  a  Cray 
X/MP  computer. 
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Let  V  =  (vi ,  V2)  •  •  * ,  v„)^  be  an  n- dimensional  vector.  Given  the  special  structure  of  L‘,  we  find 

(/  +  L*)v  =  v  + 


where 

and  vi  is  the  i  +  1  element  of  Similarly,  given  the  special  structure  of  £/*,  we  find 

where 

Equation  (17)  shows  that  equation  (13)  can  be  solved  with  2  •  flognl  AXPY  operations.  Because 
|6|  <  1,  ||Z/*||  -+  0  and  ||I/‘||  —*  0  when  n  — ►  oo,  the  AXPY  operation  may  be  truncated  without 
influencing  the  accuracy.  Formulas  will  be  given  in  Section  4  to  compute  the  smallest  truncation 
integer  k,  A  sequential  Fortran  like  code  for  solving  equation  '17)  within  truncation  error  is  given 
in  Figure  1. 


Do  20  i  =  1, 

Do  20  j  =  n,  2*  +  1,  -1 
dj  =  dj  +  (— fr)*dj_2> 

Do  40  t  =  1,  A 
Do  40  j  =  1,  n  —  2’ 
d'j  =  dj  +  (-h)*dy+2' 

Do  60  j  =  1,  » 

Xj  =  b  •  dj 

Figure  1.  The  simple  prefix  method. 


If  n  processing  elements  are  available  and  dj  is  stored  in  processor  j,  then  the  loop  20  and 
loop  40  will  lead  to  prefix  computations.  Figure  2  shows  the  prefix  computation  pattern  that 
correspond  to  loop  40  when  n  is  equal  to  8.  Loop  20  leads  to  a  similar  prefix  computation  pattern, 
except  that  the  communication  is  from  right  to  left.  Prefix  (or  recursive  doubling)  computation 
is  a  widely  used  computation  model  in  scientific  computing.  Any  linear  recursive  relation  can  be 
computed  by  recursive  doubling  [21].  A  recursive- doubling  algorithm  exists  for  solving  tridiagonal 
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systems  and  involves  matrix-matrix  multiplications  [4,  23]-  Compared  with  the  existing  recursive¬ 
doubling  algorithm,  the  computation  of  our  prefix  method  for  solving  tridiagonal  systems  (13)  is 
simple.  We  call  the  prefix  method  given  in  Figure  1  the  simple  prefix  method.  Figure  3  shows  the 
communication  pattern  of  the  widely  used  cyclic  reduction  method  [8].  In  a  comparison  of  figures 
2  and  3,  we  can  see  that  the  prefix  method  has  a  relatively  simple  communication  pattern. 


Figure  2.  Communication  of  prefix  computation. 


Figure  3.  Communication  of  cyclic  reduction  method. 

3.2  Modification  of  Symmetric  Toeplitz  System 

Our  goal  is  to  find  the  solution  of  equation  (10).  Modification  is  needed  to  convert  the  solution  of 
equation  (13)  to  the  desired  solution.  The  modification  will  be  different  for  symmetric  Toeplitz  sys- 


terns  and  for  almost  symmetric  Toeplitz  systems.  For  a  given  symmetric  Toeplitz  system,  equation 
(13)  is  modified  as 


A  =  A  +  /\A  =  A  +  VE^, 


where 


/  6  0 
0  0 


A/1  = 


0 

0 


\ 


0  0 
0  0/ 


(18) 


and  V  =  (6, 0,  •  • -,0)^  and  E  =  (1,0,  •  • -,0)^  are  n-dimensional  vectors.  By  the  matrix-modification 
formula  [16,  3,  23],  equation  (10)  can  be  s^'lved  by 


x  =  a-u  =  {a-\-ve'^)-h 

X  =  A-^  -  A-^V{I  +  ET A-'^V)-'^ET A-U 
=  i  -  A-^V{I  +  E'^A-'^Vy'^x-i, 


where  x\  is  the  first  element  of  vector  x.  If  the  calculation  approach  given  in  [18]  is  followed,  we 
have 

(/-f  = 


1 


and 


Thus 


1 + Er=i 

i=l  1=1 


n— 2 

X' 

»=i 


A'Wil  +  E'^A-^V)-'' 


_62(n-H)’(  j  _  j,2(n-|-l)  ’  ’  '  '  ’  (  I  _  f,2{n+l)  j  ' 


The  final  solution  is 


X  —  X  —  X\Z, 


(20) 


where  vector  z  is  the  right  side  of  equation  (19).  Because  |6|  <  1,  z  can  be  truncated  at  some 
integer  k\  without  affecting  the  accuracy  (see  section  4).  Furthermore,  when  n  is  large, 
i  =  0, 1,  •  •  •,/:!  will  be  less  than  machine  accuracy,  and  z  reduces  to 


^  =  ((-6)^(-6)^...,(-6)*>+^0,•••,0)^ 
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The  program  of  modification  is  given  in  Figure  4,  and  the  algorithm  for  solving  the  symmetric 
Toeplitz  tridiagonal  system  is  given  in  Figure  5. 


Step  1: 

Use  the  simple  prefix  method  to  find  the  solution  to  equation  (13). 
Step  2; 

Use  the  modification  equation  (20)  to  obtain  the  final  solution. 

Figure  5.  Algorithm  for  solving  symmetric  Toeplitz  system. 


3.3  Modification  of  Almost  Symmetric  Toeplitz  System 

A  similar  modification  can  be  given  to  solve  almost  symmetric  Toeplitz  systems.  For  almost  sym¬ 
metric  Toeplitz  systems,  equation  (13)  is  modified  such  that 


A  =  A  +  AA  =  A  +  VE'^, 

whee 

0-10 
0  0  0 

0  0  0 
0  0 
0 


(21) 
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V  =  (-1,0,  •• -,0)^,  and  E  =  (0, 1,0,  •  • -0)^.  Following  the  matrix  niodification  formula  [23,  18], 
the  solution  to  equation  (10)  becomes 

X  =  A'^d=  A-U- A-^V{I+ E^A-'^Vy^E'^A-U 
=  i  -  A-^V{I  +  E'^A~^V)-'^X2, 

where  X2  is  the  second  element  of  x.  With  this  new  modification. 


(1  + e'^a-^v)-^  = 


a-E-=oi-b)b^'‘ 


«=0  1=0  1=0 

1  _  k2{n-i+l)  1  -  /)2  ^ 

A-'V{I+E^A-'V)-'  ,  (-6)'  ,  ,  •  •  • ,  (-!.)” 


The  final  solution  is 


X  =  X  -  X2y, 


(22) 

(23) 


where  the  vector  y  is  the  right  side  of  equation  (22).  Like  the  symmetric  case,  only  the  addition  of 
the  first  k2  elements  in  equation  (23)  may  be  needed  for  a  given  accuracy  for  some  integer  k2,  and 
|6|2(n-i)  jjg  jggg  macHne  accuracy  for  0  <  i  <  fca  when  n  is  large.  Let 


y  =  (-6,(-6)2,...,(-6)"*,0,..-,0)^. 


(24) 


The  modification  computation  for  the  almost  symmetric  Toeplitz  system  is  given  in  Figure  6.  The 
algorithm  for  solving  the  almost  symmetric  Toeplitz  systems  is  similar  to  the  algorithm  for  solving 
the  symmetric  Toeplitz  system  (see  Figure  5),  except  in  step  2  the  new  modification  (Figure  6)  is 
used  to  replace  the  symmetric  Toeplitz  system  modification. 

Do  100  t  =  1,  1:2 

X  =  Xi-  X2yi 

Figure  6.  Modification  for  almost  symmetric  Toeplitz  system. 
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4  Accuracy  Analysis 

In  a  previous  section,  the  claim  was  made  that  the  AXPY  operations  and  modifications  can  be 
truncated  without  influencing  the  accuracy.  In  this  section,  we  will  study  the  truncation  accuracy. 
For  simplicity,  we  truncate  loop  20  and  loop  40  at  the  same  step  k.  The  norm  used  in  this  section 
is  the  Li-norm. 

4.1  Accuracy  Study  of  the  Simple  Prefix  Method 

Let  x,y  be  exact  solutions  as 

y  = 

=  (I  +  L  +  + - h  L"  ^)b-d, 

X  =  b-[0,l,b]-%l,0]-^-d 

=  (I  +  U  +  U^  +  -  +  U^-'^)-y. 

Let  y*  and  i'  be  the  solutions  given  by  loop  20  and  loop  40,  respectively,  as 

where  A:  is  a  power  of  2,  and 

Also,  we  define  i*  as  a  hypothetical  solution  by 

The  difference  between  x  and  x*  is 

X-i*  =(/+(/+...  +  {/"-J  )j,  -  (7  +  t/  +  .  .  .  +  (/''-l  )y 

=  (f/*' +  )y 

=  ({/*'  +  •••+{/”-')(/ -{/)x 
=  (I/*  -  U^)x. 

Thus,  we  find 

•  ||/  -  <  |6|^1  +  (25) 
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Equation  (25)  gives  the  relative  error  between  x  and  x*.  Because  the  difference  between  x*  and  x' 
is 

x*-x'  =(/+!/  +  ...  +  U‘‘-^)y  -  (/  +  f/  +  •  •  •  + 

=  (/  +  [/  +  •••  +  {/*-» 

=  (/+{/  +  •••  +  )(I^  +  •  •  •  +  )  *  6  •  d 

=  (/  +  (/  +  ...  +  +  •  •  •  +  L”-')(/  -  L){I  -  U)x 

=  (/+•■•  +  -  /,”)(/  -  f/)x, 

the  norm  becomes 

li^^]iiP<\^-w‘-(i  +  ir-‘)(i+i»i).  (26) 

If  equations  (25)  and  (26)  are  combined,  then  the  relative  error  between  the  exact  solution  x  and 
the  truncated  solution  i'  is 

11^  -^'11  <  11^ -^11  . 
l|i||  -  Hill  l|i|l 

<  i4i‘(i  +  i6r‘)(i  +  ^^(1  +  i‘i)) 

<  2-|4|‘-(l  +  |4r‘)- 

4.2  Final  Accuracy  of  Symmetric  Toeplitz  Systems 

The  truncation  error  of  the  simple  prefix  method  (Figure  1)  will  be  carried  into  the  modification 
step  and  wiU  influence  the  accuracy  of  the  final  solution.  Let  x  be  the  solution  of  a  symmetric 
Toeplitz  tridiagonal  system  (10)  and  x*  be  the  corresponding  solution  that  has  been  modified  with 
the  truncated  solution, 


X  =  X  -  XiZ, 

X*  =  i'  -  x\  2, 

where  2  =  6^  , (-6)il^l{n+i{ , •  •  -(-6)""^ i )  !  see  equation  (19).  The  error  gener¬ 

ated  by  this  truncation  is 


II®- *11  =  ll(®-®')  +  (®i -®'i)®ll 

<  l|x-x'||-|-||ii-x;||-||2|| 

<  |(x-x'||(l-|- 1|2||). 

The  norm  of  vector  z  can  be  computed  directly  as 
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Therefore, 


and 


IWI  = 


< 


< 

< 


(i+ifer^Ki-ihn 

1  _  62(n+l)  (1  _  |fr|) 

.2  (1-ir)  .  b^  H 

■(1-|6|"+1)(1-|61)-  1-|6|  H-r 


\b\ 


llx-:c*||  <  |lx-i'||(l  +  i^) 


-  Ilf  f'lir^  ~  Ilf  f'lir  ^  ^ 


\bl 


\\i-i*\\  .  ||x-x1|  l-|fe|  +  16|^ 

INI  -  INI  ^  1-H  ^ 

ll^-xll  INL1-H  +  N\ 

-  ||x|i  ’||x|r  1-I6|  ^ 


(27) 

(28) 

(29) 

(30) 

(31) 


The  only  unknown  in  the  right  side  of  equation  (31)  is  |||||,  where  x  is  the  solution  of  equation  (13) 
and  X  is  the  solution  of  equation  (10).  For  a  symmetric  Toeplitz  system, 


A  =  A  +  AA, 


where  AA  is  given  by  equation  (18).  If  equations  (10)  and  (13)  are  combined,  we  have 

{A  +  Ai4)x  =  Ax, 

(/  +  i4~^  Ai4)x  =  X, 

and 

M<tl/+i-’A^||.  (32) 

After  several  calculations,  we  find 
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and 


Therefore, 


/  \  -b  52 

(  1 

\ 

1  -6 

T 

1 

-6  I 

6 

• 

• 

62  -6  1 

1  -6 

• 

•  • 

1 

1  J 

V  (-6)"-’  • 

-6  1/ 

(  E?=i 

0  •  •  0  > 

0-0 

1 

> 

II 

0-0 

• 

1 

^  (-6)"+^ 

0  -  •  0  ; 

^  (1-Ifc|) 


and 


(i+i6r’)(i-in") 

(i-w) 


The  relative  error  of  the  solution  to  equation  (10)  is 


11^ -^11  .  M  _±_ 

INI  -  ||i||  'INI'I-Iftl 

I6i^(i  +  i6r*=)  /  (1  -  |fe|A^)(i  + 16|)\  /  6^(i  +  ir^^)(i-i6n\ 

1-|6|  1-|6|  (1_62)(i_,6|)  j-W 


When  the  order  of  the  tridiagonal  system  n  is  large,  jh]”  may  be  less  than  machine  accuracy.  In 
this  case,  the  inequality  (34)  becomes 


lk-a:1l  .  mi  +  ir-^)  /,  ,  (1-|6|^)(1  +  |6|)\  / 

lixll  -  1-|6|  1-161  j  +(1-62)(1-|6|) 


(35) 


when  truncation  is  applied  in  the  modification.  In  addition  to  the  truncation  error  carried  from 
step  1,  the  truncated  modification  will  also  generate  truncation  error.  If 


x'  =  x'  —  x\z. 
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(Figures  4  and  (20)),  then 


X  -  x'  =  X  -  x'  —  {xi  -  x[)z  +  x\{z  -  z). 


The  Li-norm  is  given  by 


llx-x'll  < 
< 


l|i  -  i'll  +  II®  -  ®'ll  *  ll®ll  +  ll®ll  •  II®  -  ®ll 
l|i-i'll{l  +  ll®ll)  +  ll®l|-||®-®ll. 


which  leads  to 


In  equation  (36), 


ll®-®'ll  <  ll®-®'ll 
ll®ll  -  l|i|l 


{{f{{(l  +  l|i|l)+lk-i||. 


13^  E  i‘‘(>  - 

t=ki 

3=0 


where  j  =  i  —  ki,  n\  =  n  —  k\.  Then 


ll®-®ll  < 
< 

< 


|6|*'+2  (i  +  161"i+i)(l-161”’) 
l_62(n+i)'  (1-161) 

161''*+2  (i  +  |6|"i+i)(l- 161^1) 

1  _  62(ni+i)  (1  -  161) 

161''«+2 

1-161' 


Note  that  HzH  <  H^H,  so  that  we  have  the  inequality 


(36) 


ll®-®1l 

INI 


< 


< 


(i  +  ll®ll)  + 


161^1+2 


1-161 

(l-16lfe)(l  +  161)\ 

1  -  |6|  i 


1  + _ _ )  +  !^. 


The  error  introduced  by  the  truncated  modification  is  insignificant  if  k\  >  k  -  2.  In  practice,  we 
can  choose  ki  =  k  and  use  the  inequality  (34)  to  compute  the  truncation  number  k. 


4.3  Accuracy  of  Solving  Almost  Symmetric  Toeplitz  Systems 

Following  the  similar  arguments  given  in  last  section,  error  bounds  can  be  obtained  for  almost 
symmetric  Toeplitz  systems.  For  almost  symmetric  systems,  we  have  a  different  modification 
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vector  y  and  a  different  modification  matrix  Av4.  From  equation  (23),  the  exact  solution  to  the 
almost  symmetric  Toeplitz  system  is 


where 


The  norm  of  vector  y  is 


X  =  X~  X2j/, 


”  1  _  /,2(n-i+l) 

IIS'!!  =  EM)'  ,.y.- 


t=l  »•=! 


1  /i6i(i-ir)(i  +  i6i"+^)\  \b\ 

-  (1-|6|)  )-l-\b\‘ 


Let  X*  be  the  solution  modified  with  the  truncated  solution  as 

X*  =  x'  —  Xjy, 

where  x',X2  are  the  same  as  defined  in  Section  4.1.  Then,  following  a  similar  deduction  for  the 
inequality  (29),  we  have 


||x-x*||  <  ||i-x'||{l  + ||i/||), 

11^ -^11  <  l|x-^qi  ||i|U  ,  \b\  . 

11x11  -  11x11  •  11x1^-^1-161^ 

^  llx  -  x^ll  11x11  1 

-  11x11  ■l|xiri-6^- 

The  above  inequality  is  in  the  same  form  as  inequality  (31),  except  that  the  ratio  |||||  is  different 
for  the  almost  symmetric  system.  For  the  almost  symmetric  Toeplitz  system, 


X  =  i4  +  Ay4, 

where  A>4  is  defined  by  equation  (21).  By  direct  calculation,  we  find 

^0  U=o-b>>''  0  •  0\ 


0  (-6)”  0  •  0 
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Therefore, 


n-l 


i=0 


{-by+\\  - 


(1-62) 


|6|(l  +  16r-H)(l-|6n 

(1  -  i^)(l  -  |6|)  ■ 


For  the  almost  symmetric  system, 

M<||/  +  i-'A^||<l  +  |M-'A-4||. 
(See  inequality  (32).)  We  conclude  that 


I|2^-g1l 

l|a:|| 


< 


< 


i6i*(i + |tr-*)  /  (i_-iy*)(uj^\  (  161(1+ |tr')(i-|6r)\ 

i-W  1.  i-|6|  A  (i-i’Ki-TO  ) 

|t|*  {,  .  (iH4h(l+H)'\  (,  ^ 

1  -  |6|  I,  1  -  |6|  A  (‘  -  ‘’Kl  -  l‘l)^  ■ 


(37) 

(38) 


When  compared  with  inequality  (34),  the  numerator  of  the  last  term  in  inequality  (38)  is  slightly 
increased. 

If  the  modification  is  truncated  to  (Figure  6),  then 


_/  _  z,!  zJ  r. 

z  —  X  —  ijy. 


Following  the  derivation  of  inequality  (36),  we  have 

l|2^-  a;'||  .  !|il[ 

IN  -  IN  'IN’ 


(i  +  l|y||)+lli/-y||, 


where 


III/  -  ^1! 


< 

< 

< 


n— 1 


E 


(-6)*+^(l  -62('‘-’)) 
1-62" 


A  g  |V(1  .  J*C-.))| 
t=k2 

|6|*2+^  (1  +  |6|"-*2-1)(1  -  |6|"-''2) 

1  -  62"  (1  -  )6|) 

|6|fc2+i 

1-161' 


(39) 


Thus, 


11^ -^11  <  (.  ,  (i-H^')(i  +  |6|)\  A  ,  |fr|  \ 

||x||  -1-|6|\^^  1-|6|  j  +(i_62)(1-|6|);^  1-|6|- 


(40) 


Similar  to  the  symmetric  case,  the  error  introduced  by  modification  truncation  is  insignificant 
if  k2  >  k  —  1.  We  could  choose  k2  =  k  and  use  inequality  (38)  for  error  estimates. 


5  Experimental  Results 

In  this  section,  the  performance  of  the  SPP  algorithm  on  the  MasPar  parallel  computer  and  on  the 
Cray  vector  computer  wiU  be  presented. 

5.1  Parallel  Computing 

The  MasPar  MP-1  is  a  distributed-memory  massively  parallel  SIMD  computer  with  a  high-speed 
two-dimensional  toroidal  mesh  topology.  A  control  unit  (ACU)  has  a  direct  connection  to  all 
the  processing  elements  (PEs)  and  issues  instructions  at  a  12.5  MHz  clock  rate.  Each  processing 
element  in  the  array  is  a  4-bit  custom  load  and  storage  processor  with  a  minimum  of  16  kilobytes 
of  memory.  Communication  is  relatively  cheap  on  the  MasPar.  For  example,  on  the  MasPar  M-1, 
a  double-precision  multiplication  function  is  ten  times  more  expensive  than  sending  the  product  to 
an  adjacent  PE. 

Table  1  gives  the  computation  and  communication  count  of  the  simple  parallel  prefix  (SPP) 
algorithm.  We  assume  that  both  the  AXPY  operations  and  the  modification  are  truncated  after  k 
operations;  the  modification  vector  used  is  either  equation  (20)  or  (24),  respectively.  The  b^'  ,i  = 
!,•••, A:,  are  computed  sequentiaUy,  or  redundantly,  on  each  PE.  The  modification  vector  z  ox  y 
will  be  computed  concurrently  on  different  PEs.  We  count  a  power-function  computation  as  8 
floating-point  operations.  So,  the  modification  phase  costs  10  parallel  operations  in  total.  The 
calculation  of  6  (equation  (14))  is  not  considered.  In  the  computing  phase,  2  •  log(A:)  parallel,  one- 
to-one  communications  are  required.  We  use  a  to  represent  the  one-to-one  communication.  On 
the  MasPar,  the  one-to-one  communication  is  achieved  by  using  the  router  command.  We  use  f) 
to  represent  the  broadcast.  On  the  MasPar,  the  broadcast  is  achieved  by  transferring  the  local 
data  to  ACU  and  then  distributing  it  to  all  PEs.  One  broadcast  communication  is  needed  in  the 
modification  phase.  Because  the  tridiagonal  systems  that  arise  in  the  compact  scheme  have  multiple 
right  sides,  the  computation  and  communication  count  for  solving  multiple  right-side  systems  is 
also  listed  in  Table  1,  where  the  computation  of  6^'  and  the  modification  vector  are  not  considered. 
Note  that  is  the  number  of  right  sides. 

Two  sample  matrices  are  chosen  to  illustrate  the  performance  of  the  SPP  algorithm  and  to 
verify  the  theoretical  error  bounds  given  in  the  previous  section.  Both  of  the  matrices  are  almost 
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Table  1.  Computation  and  communication  count  of  the  Simple  Prefix  Algorithm 


System 

Best 

sequential 

SPP 

Computation 

Communication 

Single  system 

8n-7 

5-log(A:)  +  10 

2-log(A:)  •  Q  + 

Multiple  right  sides 

(5n  —  3)  +  nl 

(4-log(fc)  +  2)  *  nl 

(2-log(A:)  •  Q  +  /3)  +  nl 

symmetric  Toeplitz  matrices  that  arise  in  the  compact  schemes.  One  matrix  is 

Ai  =  [1,3,1]-  Ai. 


The  other  matrix  is 

A2  =  [1,10,1]-  AA. 

Equation  (18)  defines  A  A.  The  corresponding  solution  of  equation  (14)  is  i»  =  and  b  = 
for  Ai  and  A2,  respectively.  The  error  is  measured  relative  to  the  LU  solution.  The  accuracy 
comparison  for  solving  system  Ai  is  given  in  Figure  7.  In  this  implementation,  no  truncation  is 
implemented  in  the  modification  phase,  and  the  prediction  formula  used  is  equation  (38).  In  solving 
A2,  the  modification  is  applied  at  the  modification  stage  with  ^2  =  and  the  prediction  formula 
used  is  equation  (39).  The  accuracy  comparison  for  solving  system  A2  is  given  in  Figure  8.  From 
Figures  7  and  8,  we  can  see  that  the  theoretical  bound  matches  the  measured  results  well. 


Figure  7.  Measured  and  predicted  accuracy  for  solving  matrix  Aj. 

Figure  9  shows  the  speedup  of  the  SPP  algorithm  over  the  best  sequential  algorithm  for  solving 
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Figure  8.  Measured  and  predicted  accuracy  for  solving  matrix  A2- 

a  single  system.  The  best  sequential  algorithm  is  based  on  the  LU  decomposition,  and  is  fine- 
tuned  to  take  advantage  of  the  almost  symmetric  Toeplitz  structure.  All  computations  are  double 
precision.  Truncation  numbers  fc  =  64  and  fc  =  16  are  chosen  to  achieve  double-precision  (10“'®) 
and  single-precision  (10“^)  accuracy,  respectively.  The  order  of  matrix  is  the  same  as  the  number 
of  PEs  available.  Because  of  the  high-speed  communication,  with  n  equal  to  Ifc,  2k,  4k,  Sk,  and 
16A;,  the  execution  time  is  not  noticeably  changed  in  parallel  processing.  The  sequential  algorithm 
is  implemented  on  a  single  PE.  Because  of  memory  limitations,  only  small  systems  are  solved  by  the 
sequential  algorithm.  The  data  used  in  Figure  9  is  predicted  based  on  the  small-size  timing.  Figure 
10  shows  the  corresponding  speedup  of  solving  a  system  with  1024  right  sides.  The  factorization  of 
the  matrix  is  not  included  in  timing  for  solving  the  system  with  multiple  right  sides.  The  speedup 
is  slightly  higher  for  solving  multiple  right-side  systems. 

Because  the  order  of  the  matrix  increases  linearly  with  the  number  of  PEs  available,  the  speedups 
given  by  Figures  9  and  10  are  memory-bounded  speedup  [20].  From  table  1,  the  problem  size,  in 
terms  of  floating-point  operations,  is  a  linear  function  of  the  order  of  the  matrix.  The  linear  memory- 
bounded  speedups  given  by  Figures  9  and  10  indicate  a  linear  speed  increase.  In  accordance  with 
the  isospeed  metric  of  scalability  [22],  the  SPP  algorithm  is  perfectly  scalable  on  the  SIMD  MasPar 
machine.  Because  the  isoefficiency  metric  [11]  is  equivalent  to  the  isospeed  metric  when  serial 
execution  has  a  fixed  speed  [22],  the  SPP  algorithm  is  also  perfectly  scalable  with  the  isoefficiency 
metric.  The  reader  may  refer  to  [22]  and  [11]  for  more  information  regarding  scalability  of  parallel 
algorithm-machine  combinations. 
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Figure  9.  Speedup  over  the  best  sequential  algorithm  on  single  system. 


Figure  10.  Speedup  over  the  best  sequential  algorithm  on  system  with  1024  right  sides. 
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5.2  Vector  Computing 

Vector  comput’  ig  is  widely  used  at  national  laboratories,  universities,  and  supercomputing  cen¬ 
ters  for  large-  cale  computing  applications.  For  this  reason,  a  CRAY-2S/4-128  at  NASA  Langley 
Research  Center  was  also  used  to  test  the  SPP  algorithm  on  a  vector  machine.  The  Cray-2  no¬ 
tation  “S”  indicates  that  the  memory  is  static  rather  than  dynamic,  and  “4-128”  indicates  that 
the  machine  has  4  processors  and  128  million  64-bit  words  of  central  memory.  Each  CPU  is  a 
register-to-register  vector  processor  with  a  4.1  nsec  minor  cycle  clock  that  can  generate  100-300 
megaflops.  The  four  processors  can  be  used  for  a  single  problem  (multi-tasking)  to  achieve  over  1 
gigaflop  of  performance. 

The  speed  of  a  vector  machine  depends  on  the  vector  length,  vector  stride,  and  the  computa¬ 
tional  richness  of  the  do  loops.  Because  the  vector  register  length  is  64  and  the  CPU  is  extiemely 
fast  in  carrying  out  floating-point  opeiat'ons,  once  operands  are  in  the  registers,  best  performance 
can  be  obtained  with  do-loop  which  have  lengths  that  are  muHiples  of  64,  which  are  computationally 
intensive,  and  which  use  unit  stride  (separation  of  memory  between  elements). 

The  chosen  sample  Iridiagonal  matrices  are  almost  symmetric  Toeplitz  and  correspond  to  the 
first  and  second  derivative  compact-difference  operators  (2)  and  (3).  The  diagonals  and  necessary 
coefficients  are: 


A3  =  [1,4,1] 

with  a,  6  =  2  ±  Vz 

A2  =  [1,10,1] 

with  a,  6  =  5  ±  2\/6 

For  the  first  experiment  on  the  Cray,  single  tridiagonal  solves  were  made.  The  test  problem 
used  here  and  in  the  rest  of  this  section  corresponds  to  f{x)  =  3a:^  —  2i  -f  1,  which  has  smooth 
exact  derivatives  f'{x)  =  9x^  -  2  and  f"{x)  =  18x.  Figure  11  shows  the  performance  of  the 
SPP  in  terms  of  CPU  seconds  and  the  matrix  order  compared  with  the  LU  decomposition  for 
computing  /'  and  /".  In  addition  to  the  reduced  memory  requirements  of  SPP  compaud  to  LU, 
the  performance  shown  in  Figure  1 1  clearly  indicated  that  the  SPP  is  faster  on  the  ve<  tor  machine 
than  tho  conventional  LU  solver;  the  benefits  increase  with  the  operator  size.  The  significant 
difference  between  the  SPP  and  LU  timing  can  be  explained  in  light  of  vector  operations  versus 
scalar  operations.  The  SPP  approach  can  be  vectorized  over  the  direction  of  the  solve;  the  LU 
approach  must  use  scalar  operations.  For  the  SPP  approach,  note  that  the  diagonal  dominance  of 
the  second-derivative  operator  f"  leads  to  faster  computations  compared  with  the  first-derivative 
operator  /'.  This  time  reduction  results  from  the  truncation  of  the  SPP  approach  to  obtain  a 
predetermined  level  of  error  (10“*'*),  which  is  essentially  machine  precision.  For  the  first-derivative 
operator  (/I3),  k  =  32  and  k\  =  24;  for  the  second-derivative  operator  i/li),  k  =  16  aim  kx  =  16, 
where  k  and  kx  are  the  truncation  numbers  on  the  solving  and  modification  pha.ses,  respectively. 

Real  applications  which  use  compact-difference  operators  require  many  tridiagonal  solves  that 


23 


Figure  11.  Timing  of  SPP  and  LU  algorithms;  single  system. 

correspond  to  time-marching  algorithms  and  involve  many  right  sides  corresponding  to  the  mul¬ 
tidimensionality  of  the  application.  In  this  second  evaluation,  with  the  same  accuracy  10"'“*,  the 
performance  of  the  SPP  is  compared  with  LU  for  multiple  right  sides.  Shown  in  Figure  12  are 
CPU  times  for  the  SPP  and  LU  for  various  orders  of  the  second-derivative  compact  operator  A^. 
(Similar  results  were  obtained  with  the  operator  >13  but  are  not  shown.)  For  applications  that 
use  small  operators  (N  <  96),  the  LU  solver  is  more  efficient  than  SPP;  for  applications  that  use 
large  operators  {N  >  96),  the  SPP  is  much  cheaper  than  the  LU  approach.  This  difference  occurs 
because  the  LU  approach  vectorizes  the  do  loop  associated  with  the  number  of  right  sides,  and 
the  SPP  vectorizes  in  the  direction  of  the  tridiagonal  solve.  With  some  creative  programming,  one 
could  potentially  vectorize  the  entire  SPP  approach  with  a  single  array,  while  the  LU  approach  can 
vectorize  over  the  right-side  arrays. 

In  the  final  experiment,  the  ability  of  SPP  to  control  truncation  error  is  demonstrated.  The 
highest  order  of  accuracy  in  the  solution  is  based  on  the  truncation  error  of  the  compact-difference 
approaches  in  equations  (2)  and  (3).  As  a  result,  to  require  machine-zero  is  overkill  for  the  compaci 
solver  and  leads  to  unnecessary  computational  cost.  By  using  the  inequality  (40),  the  choice  of 
truncation  can  be  determined  based  on  a  desired  error  bound.  Figure  13  shows  the  SPP  results 
of  truncations  A:  =  8  and  k  =  32,  which  correspond  to  errors  10“®  and  10“^'’,  respectively.  If  the 
accuracy  of  the  SPP  is  relaxed,  the  computational  cost  decreases  b  c  factor  of  2. 
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6  Conclusion 


A  central  goal  of  parallel  processing  is  to  achieve  better,  more  accurate  solutions.  Because  obtaining 
more  accurate  solutions,  in  general,  means  adding  more  discretization  points,  larger  systems  result 
and  require  greater  computational  power.  The  accuracy  of  a  simulation  solution  is  also  bounded 
by  the  discretization  scheme  used.  A  clear  requirement  for  obtaining  a  more  accurate  solution  is  to 
adopt  discretization  methods  with  high-order  accuracy.  Previously,  a  highly  accurate  discretization 
scheme,  the  compact  finite-difference  scheme,  has  been  proposed.  However,  the  almost  symmetric 
Toeplitz  tridiagonal  systems  that  arise  from  compact  schemes  are  sequential  in  nature  and  difficult 
to  solve  efficiently  on  parallel  computers.  In  this  paper,  we  have  introduced  a  parallel  algorithm, 
the  simple  parallel  prefix  (SPP)  algorithm,  for  compact  schemes. 

The  SPP  algorithm  is  designed  for  fine-grain  computing.  With  n  processors,  the  SPP  algorithm 
solves  an  n-dimensional  system  with  2iog(n)  +  1  AXPY  operations.  Two  prefix  communications 
are  required  in  the  solving  phase  and  one  broadcast  communication  is  required  in  the  modification 
phase.  In  comparison  with  existing  tridiagonal  solvers  [17,  8],  the  SPP  algorithm  is  simple  in 
computing  and  simple  in  communication.  It  requires  storage  of  only  one  log(n)-dimensional  vector 
for  the  computing  phase  and  one  n-dimensional  vector  for  the  modification  phase.  When  the 
tridiagonal  system  is  diagonally  dominant,  both  the  computing  and  the  modification  phases  can 
be  truncated  without  degrading  the  accuracy.  Memory  requirements  wiU  be  further  reduced  when 
truncation  is  applied.  A  detailed  accuracy  analysis  has  been  conducted  to  find  the  appropriate 
truncation  number.  Experimental  results  show  that  the  SPP  algorithm  achieves  a  speedup  greater 
than  1000  over  the  best  sequential  algorithm  on  a  16K  PEs  MasPar  M-1  SIMD  parallel  computer. 
In  addition  to  the  good  performance  on  the  SIMD  machines,  the  SPP  algorithm  also  out  performs 
the  best  sequential  algorithm  on  a  vector  machine  (Cray  2),  even  on  systems  with  multiple  right 
sides.  Experimental  and  theoretical  results  show  that  the  SPP  algorithm  is  a  good  choice  for 
compact  schemes  and  for  the  emerging  high-performance  paraUel  computers. 

The  SPP  algorithm  is  designed  for  almost  symmetric  Toeplitz  tridiagonal  systems.  It  can  be 
modified  for  different  boundary  conditions  and  for  cases  where  the  number  of  processors  p  is  less 
than  the  dimension  of  the  system.  However,  generalization  of  the  algorithm  for  general  tridiagonal 
systems  or  for  band  systems  is  unlikely. 

The  work  presented  in  this  paper  is  a  continuation  of  efforts  to  design  efficient  parallel  solvers 
for  compact  scheme.  An  efficient  solver,  the  PDD  algorithm,  for  coarse-  or  median-grain  computing 
has  been  proposed  [18].  The  PDD  algorithm  and  the  SPP  algorithm  can  be  combined  on  parallel 
machines  with  vector  processing  units. 
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